EDA
In this section, we showcase our primary dataset as well as supplementary datasets to get the bigger picture of what data we are working with.
The goal of this section is to explore how we can tentatively use our data in tandem with strategies and techniques found from our literature review in order to profile syndemic relationships for type II diabetes.
Packages
Demographic Data
With our research goal of evaluating how the social and demographic factors interact with diabetes in a syndemic relationship, it is important to understand the demographic breakdown of the group that was studied by the National Health and Nutrition Examination Study (NHANES).
The demographic data set for the study includes 15560 observations of 29 variables including information on race, gender, family income, education level, and language spoken. Names and summaries of each of the variables are shown below.
Missing Values
The data set has 682 missing values for the age variable and 2201 missing values for the ratio_family_income_poverty variable.
Distribution of Continuous Variables
Note:
The Department of Health and Human Services (HHS) poverty guidelines were used as the poverty measure to calculate this ratio. So, the ratio was calculated as:
Ratio = (Total Annual Income)/(Poverty Guideline specific to each year)
Distribution of Categorical Variables
demographics <- demographics[!is.na(demographics$education_level), ]
Gender and Education Stratified by Race
To understand how confounding variables may affect our analysis, it is important to compare the distributions of various factors such as gender and education level by other demographic factors such as race.
Diabetes Data
The diabetes data set from the National Health and Nutrition Examination Study contains information of diagnosis and progression of disease for each participant in the study. This dataset contains 28 variables which include when participants were diagnosed, whether or not they are on insulin, how frequently they see a doctor, etc. Names and summaries of each of the variables are shown below.
Missing Values
There are missing values in the age_informed, insulin_length, num_dr_visits_past_year, and how_often_glucose_check variables. These missing values are likely for participants who have not been informed of a diabetes diagnosis.
Distribution of Diagnostic Variables
Health and Nutrition Data
The health and nutritional behavior data details participant’s food choices, such as Breastfeeding and other childhood feeding practices, Frequency of getting meals prepared away from home, Frequency of getting meals from fast food or pizza places, Use of convenience foods, and knowledge of the my plate program. Names and summaries of variables are shown below. The data represent 15560 individuals with 46 different variables observed.
Column Names:
1. respondent_sequence_num
2. ever_breastfed_or_fed_breastmilk
3. age_stopped_breastfeeding_days
4. diet_healthiness
5. community_government_meals_delivered
6. eat_meals_at_community_senior_center
7. attend_kindergarten_thru_high_school
8. school_serves_school_lunches
9. school_serves_complete_breakfast_daily
10. summer_program_meal_free_reduced_price
11. meals_not_home_prepared_count
12. meals_from_fast_food_or_pizza_place_count
13. ready_to_eat_foods_past_30_days
14. frozen_meals_pizza_past_30_days
Data Types & Missing Values
Breastfeeding and Weaning
Table of respondents fed breast milk or breastfed:
Value Frequency Percentage
1 Yes 2066 78.73476
2 No 558 21.26524
Summary Statistics for age stopped breastfeeding in days:
mean_age_stopped_breastfeeding median_age_stopped_breastfeeding
1 198.6769 121
sd_age_stopped_breastfeeding min_age_stopped_breastfeeding
1 218.0595 5.397605e-79
max_age_stopped_breastfeeding
1 1095
Nutritional Practices
Education
Table of respondents who attended kindergartedn through highschool:
Value Frequency Percentage
1 Yes 3849 78.73476
2 No 753 21.26524
Laboratory Data
There are 43 XPT data of laboratory tested data taken from the NHANES website. With so many XPT files of laboratory data, the cleaned dataset therefore contains 337 columns of variables. Many are strongly correlated with each other as some variables are the same just in a different metric. Due to how many XPT files are being combined and how many variables exist in each file, manually removing these highly correlated columns was not done. Additionally after combining each file to a common Respondent Sequence ID number, many missing values exist in each row. There are missing values in each row due to the combining process of each data file.
The cleaning process removed rows where all columns except for the first are NaNs as well as columns where there were only 1 unique value in each row. Below is a summary of the dataset as well as some visualizations of chosen variables among many that we will consider in this project.
Albumine in Urine (ug/mL) Testing
Creatinine (mg/dL) Testing
Arsenic Total (ug/L) Testing
Triglyceride (mg/dL) Testing
Total Cholesterol (mg/dL) Testing
Hemoglobin (g/dL) Testing
Questionnaire Data
Alcohol Data
General Alcohol Consumption
The majority of the survey population has had alcohol at least once in their life.
How Much Alcohol Consumed Per Day
Individuals who have had an average of 1-3 drinks per day over the last year make up around 80% of the data. Individuals who reporting having 4+ drinks per day make up the other 20%, with 4-6 drinks making up 10%, and 7+ making up the other 10%.
Depression Data
In every question asked in the depression questionnaire, most than half of the time, the respondent said not at all. The “feeling tired or having little energy” and “trouble sleeping or sleeping too much” say higher proportions of “several days” and “more than half the days” responses. The next highest not-at-all to other answers ratio was in “poor appetite or overeating”, and the other questions are all fairly even.
Health Insurance Data
# A tibble: 4 × 3
`Covered by Insurance?` Count Proportion
<chr> <int> <dbl>
1 Yes 13671 0.879
2 No 1852 0.119
3 Don't know 29 0.00186
4 Refused 8 0.000514
Around 87.9% of the respondents were covered by insurance, and 11.9% were not.
# A tibble: 7 × 3
`Insurance Type` No Yes
<chr> <int> <int>
1 covered_by_chip 15389 171
2 covered_by_medi_gap 15462 98
3 covered_by_medicaid 11381 4179
4 covered_by_medicare 12968 2592
5 covered_by_other_government_insurance 14552 1008
6 covered_by_private_insurance 8457 7103
7 covered_by_state_sponsored_health_plan 14623 937
The most common type of insurance was a private insurance plan, followed by medicaid, medicare, and other government insurance.
Access to Healthcare and Hospital Usage Data
Respondents reported that they were generally in execellent or good health conditions.
Most respondents also have a consistent place to go to for health care, such as an urgent care or primary care physician.
Occupation Data
Most respondents are working between 35-40 hours per week.
A majority of the respondents are working at a job or business, followed by a good proportion of those who are out of work.
Social Meal Support